Skip to content

Modernize code generation for the external LLVM 22 back-end#3169

Merged
maleadt merged 13 commits into
mainfrom
tb/ptx_llvm22
Jun 8, 2026
Merged

Modernize code generation for the external LLVM 22 back-end#3169
maleadt merged 13 commits into
mainfrom
tb/ptx_llvm22

Conversation

@maleadt

@maleadt maleadt commented Jun 6, 2026

Copy link
Copy Markdown
Member

Machine code generation goes through an external LLVM 22 llc now, with the in-process LLVM only driving the middle end. That makes a bunch of old workarounds unnecessary, and unlocks some functionality:

  • llvm_compat reports the capabilities of NVPTX_LLVM_Backend_jll instead of the in-process LLVM, so PTX target selection isn't held back by the Julia-bundled version anymore. This adds sm_88 and sm_110 support.
  • llvm.-prefixed declarations the in-process LLVM doesn't recognize no longer trigger device-runtime linking; the back-end lowers them.
  • Dropped the fast min/max workaround for LLVM 18- generates non-existing min.NaN.f64/max.NaN.f64 instructions #2886, which is fixed in LLVM 21+.
  • Float16 atomic addition uses atomicrmw fadd instead of inline assembly, and BFloat16 gets native atomic add/sub on Julia 1.11+ (with a CAS fallback below sm_70 resp. sm_90).
  • active_mask() calls llvm.nvvm.activemask on LLVM 20+. The inline-assembly fallback for older versions is marked side-effecting, as it could previously be hoisted or merged across divergent control flow.
  • Fast Float32 exp2 uses the ex2.approx intrinsic.

One caveat: llc recomputes the data layout from the triple, ignoring the module's, so 128-bit integers are always 16-byte aligned on the device. Julia only aligns them that way since 1.12, meaning aggregates with (U)Int128 fields may lay out differently on older hosts. Kernel arguments with such layout mismatches are now rejected with an error pointing at Julia 1.12.

Also includes a test guarding against dynamically-indexed aggregate arguments being copied to local memory (the regression fixed by llvm/llvm-project#201772), and updates the fdiv/rcp PTX tests for the new back-end's lowering (inv now selects rcp instructions, and fast Float64 division gets Newton refinement).

maleadt added 9 commits June 5, 2026 20:31
Machine code is generated by an external, up-to-date LLVM, so target
selection should not be limited by the in-process LLVM version (which
only drives the middle end, and is not configured for any particular
device). This makes the back-end compile natively for recent devices,
e.g., sm_120a instead of sm_90 with a rewritten PTX header on
Blackwell, and unlocks newer PTX ISAs.
Intrinsics unknown to the in-process LLVM, e.g. selected by libdevice's
__CUDA_ARCH dispatch, were counted as undefined functions, needlessly
compiling for relocatable code and linking against cudadevrt.
Old LLVM back-ends generated nonexistent min.NaN/max.NaN instructions
for fast fp64 min/max and fp16 minimum/maximum (#2886); the external
back-end lowers these correctly for every subtarget.
Plain atomicrmw fadd gives LLVM real atomic semantics instead of an
opaque asm blob, generating the same instructions while remaining
optimizable; the back-end also expands it on devices without native
support. BFloat16 atomic addition is new (sm_90 hardware, expanded
elsewhere), and requires Julia 1.11 for bfloat codegen support.
The intrinsic has no side effects, unlike the inline assembly it
replaces, so it can be CSE'd, hoisted, and constant-folded.
The inline assembly lacked the side-effect flag, allowing LLVM to merge
or hoist it across divergent control flow. Use the convergent intrinsic
where available (LLVM 20), and mark the assembly side-effecting before.
The back-end aligns 128-bit integers to 16 bytes, but Julia versions
before 1.12 align them to 8, so aggregates with (U)Int128 fields can
lay out differently on host and device. These used to be compiled
quietly, reading garbage on the device; error instead.
@maleadt maleadt changed the title Modernize for LLVM 22 Modernize code generation for the external LLVM 22 back-end Jun 6, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 356c85b Previous: 112549e Ratio
array/accumulate/Float32/1d 100163 ns 99517 ns 1.01
array/accumulate/Float32/dims=1 75240 ns 75910 ns 0.99
array/accumulate/Float32/dims=1L 1628693 ns 1594980 ns 1.02
array/accumulate/Float32/dims=2 140970 ns 141259 ns 1.00
array/accumulate/Float32/dims=2L 652567 ns 653724 ns 1.00
array/accumulate/Int64/1d 118755 ns 118852 ns 1.00
array/accumulate/Int64/dims=1 79140 ns 79413 ns 1.00
array/accumulate/Int64/dims=1L 1723746 ns 1709492 ns 1.01
array/accumulate/Int64/dims=2 153114 ns 154250 ns 0.99
array/accumulate/Int64/dims=2L 960242 ns 960390 ns 1.00
array/broadcast 18384 ns 18461 ns 1.00
array/construct 1197.5 ns 1193 ns 1.00
array/copy 16621 ns 16550 ns 1.00
array/copyto!/cpu_to_gpu 213583 ns 214764 ns 0.99
array/copyto!/gpu_to_cpu 278812 ns 280613 ns 0.99
array/copyto!/gpu_to_gpu 10254 ns 10344 ns 0.99
array/iteration/findall/bool 133353 ns 134100 ns 0.99
array/iteration/findall/int 147614 ns 147421 ns 1.00
array/iteration/findfirst/bool 69959 ns 112673 ns 0.62
array/iteration/findfirst/int 71112 ns 112820 ns 0.63
array/iteration/findmin/1d 67998 ns 67036 ns 1.01
array/iteration/findmin/2d 101335 ns 100960 ns 1.00
array/iteration/logical 193754 ns 193400 ns 1.00
array/iteration/scalar 65567 ns 64965 ns 1.01
array/permutedims/2d 49616 ns 49581 ns 1.00
array/permutedims/3d 50731 ns 50662 ns 1.00
array/permutedims/4d 50885 ns 50962 ns 1.00
array/random/rand/Float32 11550 ns 12069 ns 0.96
array/random/rand/Int64 22788 ns 24024 ns 0.95
array/random/rand!/Float32 7837.333333333333 ns 8798.666666666666 ns 0.89
array/random/rand!/Int64 17838 ns 20664 ns 0.86
array/random/randn/Float32 35484 ns 35378 ns 1.00
array/random/randn!/Float32 23789 ns 23654 ns 1.01
array/reductions/mapreduce/Float32/1d 33624 ns 33516 ns 1.00
array/reductions/mapreduce/Float32/dims=1 38432 ns 38509 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 50250 ns 50248 ns 1.00
array/reductions/mapreduce/Float32/dims=2 56205 ns 55822 ns 1.01
array/reductions/mapreduce/Float32/dims=2L 67291 ns 67519 ns 1.00
array/reductions/mapreduce/Int64/1d 39237 ns 39436 ns 0.99
array/reductions/mapreduce/Int64/dims=1 41561 ns 41192 ns 1.01
array/reductions/mapreduce/Int64/dims=1L 86410 ns 86477 ns 1.00
array/reductions/mapreduce/Int64/dims=2 58727 ns 57772 ns 1.02
array/reductions/mapreduce/Int64/dims=2L 82371 ns 83119 ns 0.99
array/reductions/reduce/Float32/1d 33454 ns 33724 ns 0.99
array/reductions/reduce/Float32/dims=1 38555 ns 38486 ns 1.00
array/reductions/reduce/Float32/dims=1L 50180 ns 50211 ns 1.00
array/reductions/reduce/Float32/dims=2 56128 ns 55852 ns 1.00
array/reductions/reduce/Float32/dims=2L 66986 ns 69022 ns 0.97
array/reductions/reduce/Int64/1d 39115 ns 39412 ns 0.99
array/reductions/reduce/Int64/dims=1 41248 ns 40972 ns 1.01
array/reductions/reduce/Int64/dims=1L 86521 ns 86447 ns 1.00
array/reductions/reduce/Int64/dims=2 58247 ns 57742 ns 1.01
array/reductions/reduce/Int64/dims=2L 83524 ns 82671 ns 1.01
array/reverse/1d 16824 ns 16903 ns 1.00
array/reverse/1dL 67720 ns 67929 ns 1.00
array/reverse/1dL_inplace 65187 ns 65328 ns 1.00
array/reverse/1d_inplace 8317.666666666666 ns 10020.333333333334 ns 0.83
array/reverse/2d 20065 ns 20099 ns 1.00
array/reverse/2dL 71781 ns 71890 ns 1.00
array/reverse/2dL_inplace 64950 ns 65089 ns 1.00
array/reverse/2d_inplace 9543 ns 9724 ns 0.98
array/sorting/1d 2654742 ns 2658878 ns 1.00
array/sorting/2d 1033402 ns 1040327 ns 0.99
array/sorting/by 3180132 ns 3193494 ns 1.00
cuda/synchronization/context/auto 1158.7 ns 1122.1 ns 1.03
cuda/synchronization/context/blocking 931.6296296296297 ns 908.9714285714285 ns 1.02
cuda/synchronization/context/nonblocking 6052.6 ns 6022.8 ns 1.00
cuda/synchronization/stream/auto 989.4 ns 993.9 ns 1.00
cuda/synchronization/stream/blocking 837.8115942028985 ns 827.3783783783783 ns 1.01
cuda/synchronization/stream/nonblocking 5901.6 ns 5915 ns 1.00
integration/byval/reference 143146 ns 142979 ns 1.00
integration/byval/slices=1 145285 ns 145133 ns 1.00
integration/byval/slices=2 283812 ns 283763 ns 1.00
integration/byval/slices=3 422279 ns 422104 ns 1.00
integration/cudadevrt 101563 ns 101484 ns 1.00
integration/volumerhs 8997466 ns 9077118 ns 0.99
kernel/indexing 12705 ns 12734 ns 1.00
kernel/indexing_checked 13427 ns 13463 ns 1.00
kernel/launch 2058 ns 2146.3333333333335 ns 0.96
kernel/occupancy 724.4328358208955 ns 688.7569444444445 ns 1.05
kernel/rand 15233 ns 14254 ns 1.07
latency/import 3850082067 ns 3847133996 ns 1.00
latency/precompile 4630800385 ns 4625229019 ns 1.00
latency/ttfp 4521745566 ns 4496455873 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

maleadt and others added 4 commits June 6, 2026 18:02
The external back-end selects fast minnum/minimum to single min/max
instructions instead of compare + select, picking the NaN-propagating
variants where available.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Julia's floating-point min/max follow IEEE 754-2019 minimum/maximum
semantics, which map directly onto these intrinsics. The external
back-end legalizes them on every subtarget (min.NaN/max.NaN on sm_80+,
a NaN/signed-zero-correct expansion elsewhere), so drop the manual
emulation based on __nv_fmin plus a NaN fix-up. That emulation also
inherited llvm.minnum's loose signed-zero semantics, causing constant
folding to break the -0.0 < +0.0 ordering on device.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
device_layout called sizeof on every zero-field DataType, but types
like Symbol don't have a definite size. Non-isbits arguments are passed
by reference, so their layout is Julia's business on both sides; treat
them as opaque. Only affected Julia 1.10/1.11, where the layout check
is active. Also add tests for the Int128 layout rejection itself.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that targets are selected based on the back-end LLVM, recent
devices compile natively (e.g. sm_120a) rather than for an older
baseline. Adjust the feature-set expectation to consult the back-end
version, and accept the wider vector accesses (v2.b64) such targets
prefer over v4.b32.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 7, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.32%. Comparing base (aa47d7a) to head (356c85b).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3169      +/-   ##
==========================================
- Coverage   16.33%   16.32%   -0.02%     
==========================================
  Files         124      124              
  Lines        9875     9875              
==========================================
- Hits         1613     1612       -1     
- Misses       8262     8263       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maleadt maleadt merged commit 1caa3ad into main Jun 8, 2026
2 checks passed
@maleadt maleadt deleted the tb/ptx_llvm22 branch June 8, 2026 04:51
maleadt added a commit to JohnCobbler/CUDA.jl that referenced this pull request Jun 11, 2026
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant